Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
作者信息:Huawei 左鹏飞和上交
摘要:
Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs.【现有的Multi turn需要重复计算KV Cache】 To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes【调度器感知型的提取和驱除方案】 to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8× for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.
an efficient KV caching system for multi-turn conversations 包含以下挑战
- 额外的存储空间和GPU之间的通信很慢
- layer-wise preloading scheme
- asynchronous saving scheme
- KV Cache占据的存储空间非常大
- 利用host memory和disks
- 利用额外存储空间会导致大部分KV Cache存储在Disks上,很慢
- scheduler-aware KV cache fetching scheme
- 当文本超出模型限制时,需要对之前的文本进行截断,位置编码和之前不一样,先前的KV Cache全部失效
- 解耦KV Cache和位置编码嵌入
整体架构
pre-loading layer-wise策略
可能会遇到的问题:预取时间比计算时间久,依旧有气泡。需要开个缓冲区,更早地预取。
Asynchronous Saving
prefill阶段最后一层可以被decode阶段overlap
Scheduler-aware Fetching and Eviction
Eviction:在window内的优先被豁免,假如必须驱除windows的,优先驱逐windows最后那个
假如没有那么高的负载,岂不是无效了?
Decoupled KV Cache Truncation
实验
消融实验:1 A100,Llama-13B,batch size16,length1000